Contents

1 RDD(Resilient Distributed Dataset)
2 Transformation
2.1 textFile()
2.2 map()
2.3 union()
2.4 filter()
3 Action
3.1 take()
3.2 collection()

譴..

pyspark襯 襦 built-in 襯 蟆 譬.
蠏碁讌 朱 JVM り るる螳 覦 .

1 RDD(Resilient Distributed Dataset) #

  • 覿磯 覲蟆 覿螳(immutable) 螳豌 覈
  • 企ろ一 碁 一朱

2 Transformation #

RDD襯 襷. 讌襷, 讀 讌 螻 "′" 觜襦 ろ.

2.1 textFile() #

c:\data\test.txt殊 2螳 一 RDD襯 襷.
lines = sc.textFile("c:\\data\\test.txt", 2)
lines.collect()


2.2 map() #


2.3 union() #

2螳 RDD 豺蠍, 3螳 RDD 豺蠍
lines1 = sc.parallelize(['a', 'b', 'c'])
lines2 = sc.parallelize(['d', 'e', 'f'])
lines3 = sc.parallelize(['g', 'h', 'i'])
lines = lines1.union(lines2).union(lines3)
for line in lines.collect():
	print(line)

蟆郁骸
>>> lines1 = sc.parallelize(['a', 'b', 'c'])
>>> lines2 = sc.parallelize(['d', 'e', 'f'])
>>> lines3 = sc.parallelize(['g', 'h', 'i'])
>>> lines = lines1.union(lines2).union(lines3)
>>> for line in lines.collect():
...     print(line)
...
a
b
c
d
e
f
g
h
i
>>>

2.4 filter() #


lines = sc.parallelize(['螳讌', '覓', '覦一', '豢'])
choo = lines.filter(lambda x: "豢" in x)
choo.collect()

蟆郁骸
>>> lines = sc.parallelize(['螳讌', '覓', '覦一', '豢'])
>>> choo = lines.filter(lambda x: "豢" in x)
>>> choo.collect()
['覦一', '豢']
>>>



3 Action #

′ RDD 企 豌襴襯 蟆郁骸襯 襴危. 襴伎 殊企 襦蠏碁企 HDFS螳 碁 ろ襴讌 ロ.

3.1 take() #

take(2) RDD 2螳 覿る朱 蟆
lines = sc.parallelize(['a', 'b', 'c'])
for line in lines.take(2):
	print(line)

蟆郁骸
>>> lines = sc.parallelize(['a', 'b', 'c'])
>>> for line in lines.take(2):
...     print(line)
...
a
b
>>>

3.2 collection() #

豌 覿り鍵

lines = sc.parallelize(['a', 'b', 'c'])
for line in lines.collect():
	print(line)

蟆郁骸
>>> lines = sc.parallelize(['a', 'b', 'c'])
>>> for line in lines.collect():
...     print(line)
...
a
b
c